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Abstract 

The problem of assessing the performance of algorithms used for the minimization 
of an ^i-penalized least-squares functional, for a range of penalty parameters, is in- 
vestigated. A criterion that uses the idea of 'approximation isochrones' is introduced. 
Five different iterative minimization algorithms are tested and compared, as well as 
two warm-start strategies. Both well-conditioned and ill-conditioned problems are 
used in the comparison, and the contrast between these two categories is highlighted. 

1 Introduction 

In recent years, applications of sparse methods in signal analysis and inverse problems 
have received a great deal of attention. The term compressed sensing is used to describe 
the ability to reconstruct a sparse signal or object from far fewer linear measurements 
than would be needed traditionally [1]. 

A promising approach, applicable to the regularization of linear inverse problems, 
consists of using a sparsity-promoting penalization. A particularly popular penalization 
of this type is the ii norm of the object in the basis or frame in which the object is assumed 
to be sparse. In [2] it was shown that adding an ii norm penalization to a least squares 
functional (see expression ([1]) below) regularizes ill-posed linear inverse problems . The 
minimizer of this functional has many components exactly equal to zero. Furthermore, 
the iterative soft-thresholding algorithm (1ST) was shown to converge in norm to the 
minimizer of this functional (earlier work on ii penalties is in [3l|4]). 

It has also been noted that the convergence of the 1ST algorithm can be rather slow 
in cases of practical importance, and research interest in speeding up the convergence 
or developing alternative algorithms is growing O [6l [71 [HI [9] . There are already several 
different algorithms for the minimization of an ii penalized functional. Therefore, it is 
necessary to discuss robust ways of evaluating and comparing the performance of these 
competing methods. The aim of this manuscript is to propose a procedure that assesses 
the strengths and weaknesses of these minimization algorithms for a range of penalty 
parameters. 

Often, authors compare algorithms only for a single value of the penalty parameter and 
may thereby fail to deliver a complete picture of the convergence speed of the algorithms. 
For the reader, it is difficult to know if the parameter has been tuned to favor one or the 
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other method. Another issue plagueing the comparison of different minimization algo- 
rithms for problem ([2]) below is the confusion that is made with sparse recovery. Finding 
a sparse solution of a linear equation and finding the minimizer of ^ are closely related 
but (importantly) different problems: The latter will be sparse (the higher the penalty 
parameter, the sparser), but the former does not necessarily minimize the £i penalized 
functional for any value of the penalty parameter. Contrary to many discussions in the 
literature, we will look at the minimization problem ([2]) independently of its connection 
to sparse recovery and compressed sensing. 

The central theme of this note is the introduction of the concept of approximation 
isochrone, and the illustration of its use in the comparison of different minimization algo- 
rithms. It proves to be an effective tool in revealing when algorithms do well and under 
which circumstances they fail. As an illustration, we compare five different iterative al- 
gorithms. For this we use a strongly ill-conditioned linear inverse problem that finds its 
origin in a problem of seismic tomography [lOj . a Gaussian random matrix (i.e. the ma- 
trix elements are taken from a normal distribution) as well as two additional synthetic 
matrices. In the existing literature, most tests are done using only a matrix of random 
numbers, but we believe that it is very important to also consider other matrices. Actual 
inverse problems may depend on an operator with a less well-behaved spectrum or with 
other properties that could make the minimization more difficult. Such tests are usually 
not available. Here we compare four operators. Among other things, we find that the 
strongly singular matrices are more demanding on the algorithms. 

We limit ourselves to the case of real signals, and do not consider complex variables. 
In this manuscript, the usual 2-norm of a vector x is denoted by ||x|| and the 1-norm is 
denoted by ||a;||i. The largest singular value of a matrix K is denoted by ||i^||. 



2 Problem statement 



After the introduction of a suitable basis or frame for the object and the image space, the 
minimization problem under study can be stated in its most basic form, without referring 
to any specific physical origin, as the minimization of the convex functional 

Fx{x) = \\Kx-yf + 2X\\x\\i (1) 

in a real vector space {x G MP, A > 0), for a fixed linear operator K £ W^^p and data 
y G M™ . In the present analysis we will assume the linear operator K and the data y are 
such that the minimizer of ([T]) is unique. This is a reasonable assumption as one imposes 
penalty terms, typically, to make the solution to an inverse problem unique. 
We set 

x(A) = arg min \\Kx — i/lp -|- 2A||x||i. (2) 

X 

The penalty parameter A is positive; in applications, it has to be chosen depending on the 
context. Problem ([2]) is equivalent to the constrained minimization problem: 

x{p) = aig min — ?/||^ (3) 

lklli<p 

with an implicit relationship between p and A: p = ||x(A)||i. It follows from the equations 
below that the inverse relationship is: A = maxj|(J^^(y — Kx{p)))i\. Under these 
conditions one has that: x{X) = x{p). One also has that x(A) = for all A > Amax = 
maxi \{K'^y)i\. 
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2.1 Direct method 



An important thing to note is that the minimizer x (and thus also x) can in principle be 
found in a finite number of steps using the homotopy/LARS method |1H I12j. This direct 
method starts from the variational equations which describe the minimizer x: 

{K'^{y - Kx))i = A sgn(xi) if Xj / , . 

\{KT{y-Kx))i\ < A if Xi = 

Because of the piece-wise linearity of the equations dH, it is possible to construct a;(Astop) 
by letting A in ([1]) descend from Amax to Astop, and by solving a linear system at every 
value of A where a component in x{\) goes from zero to nonzero or, exceptionally, from 
nonzero to zero. The first such point occurs at A = Amax- The linear systems that have to 
be solved at each of these breakpoints are 'small', starting from 1x1 and ending with s x s, 
where s is the number of nonzero components in x(Astop)- Such a method thus constructs 
x(A) exactly for all Amax > A > Astop, or equivalently all x with < ||x(A)||i < ||x(Astop)||i- 
It also follows that x(A) is a piecewise linear function of A. 

Implementations of this direct algorithm exist [13^ [T4l I15j . and exhibit a time com- 
plexity that is approximately cubic in s (the number of nonzero components of x{X)). If 
one is interested in sparse recovery, this is not necessarily a problem as time complexity 
is linear for small s. In fact, the algorithm can be quite fast, certainly if one weighs in 
the fact that the exact minimizer (up to computer round-off) is obtained. A plot of the 
time complexity as a function of the number of nonzero components in x(A) is given in 
figure [1] (left hand side) for an example operator i^^^^ and data y (see start of section H] 
for a description of K^^'^). The plot shows that the direct algorithm is useable in practice 
for about s < 10^. This graph also illustrates the fact that the size of the support of x(A) 
does not necessarily grow monotonically with decreasing penalty parameter (this depends 
on the operator and data). 

It follows immediately from equation ([4]) that the minimizer x{X) satisfies the fixed 
point equation 

x = Sx[x + K^{y-Kx)] (5) 
where Sx is the well-known soft-thresholding operator applied component-wise: 

u > A 

Sx{u) = { |n| < A (6) 

n < -A 

In real life we have to take into account that computers work with floating point, inexact, 
arithmetic. The definition of an exact solution, in this case, is that x{X) satisfies the 
fixed-point equation ([5]) up to computer precision: 

\\x - Sx[x + K^iy - i^x)]||/||x|| « 10-16. (7) 

The direct algorithm mentioned before can be implemented using fioating point arithmetic 
(|131ll4j do this). The implementation [15] can handle both exact arithmetic (with integer, 
rational numbers) and fioating point arithmetic. An example of the errors made by the 
direct method is illustrated figure [D (right). We will still use the term 'exact solution' as 
long as condition ([7]) is satisfied. 
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Figure 1: Left: Time complexity of the direct algorithm mentioned in section [2l Horizontal 
axis: number of nonzero components in the minimizer x. Vertical axis: time needed by 
the direct algorithm to calculate this minimizer (seconds). The continuous line represents 
a cubic fit. The support of x(A) may sometimes decrease for increasing values of ||x(A)||i. 
This phenomenon is clearly visible in the zoomed area. Right: The relative error, ||x — 
Sx{x + K^{y - Kx))\\/\\x\\, of the resulting minimizer at each step (there are more steps 
than nonzero components because of the phenomenon in the zoomed area in the left hand 
side plot). Due to floating point arithmetic, this is not exactly zero. These two pictures 
were made by letting A decrease from Amax to Agtop = Amax/2^^'^^^^. The operator used is 

2.2 Iterative algorithms 

There exist several iterative methods that can be used for the minimization problems ^ 
or dal): 

(a) The iterative soft-thresholding (1ST) algorithm already mentioned in the introduc- 
tion can be written as: 

^(n+l) ^ g^^^in) ^ j^T^y _ ^^(n))]^ ^(0) ^ q (g) 

Under the condition \\K\\ < 1 the limit of this sequence coincides with the minimizer 
x(A) and F\{x^'^^) decreases monotonically as a function of n [2]. For ||K|| < \/2, 
there is still (weak) convergence [161 Corollary 5.9], but the functional ([T|) is no 
longer guaranteed to decrease at every step. This algorithm is probably the easiest 
to implement. 

(b) a projected steepest descent method [5] (and related [iTl expression (59)]): 

^(n+l) ^ p^j^(n,) ^ ^(n) ^(n)]^ ^(0) ^ ^g) 

with r(") = K^{y - Kx^")) and P^""^ = Hr^*^) || Vl|i^r(") f . Pp{-) denotes the orthog- 
onal projection onto an ii ball of radius p, and can be implemented efficiently by 
soft-thresholding with an appropriate variable threshold. 

(c) the 'GPSR-algorithm' (gradient projection for sparse reconstruction), another iter- 
ative projection method, in the auxiliary variables u,v > with x = u — v [?]• 

(d) the '£i-ls-algorithm', an interior point method using preconditioned conjugate gra- 
dient substeps (this method solves a linear system in each outer iteration step) [6] . 
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(e) 'FISTA' (fast iterative soft-thresholding algorithm) is a variation of the iterative 
soft-thresholding algorithm. Define the (non- linear) operator T by T{x) = S\[x + 
K^{y — Kx)]. Then the FISTA algorithm is: 



X 



(n+i) = T ( + ^^J^ (xW - x("-i)) j = 0, (10) 



where = ^^V^+^i^'- ') ^^^^ ^(i) _ -|^_ ^las virtually the same complexity as 



algorithm (a), but can be shown to have better convergence properties [9]. 



2.3 Warm-start strategies 

There also exist so-called warm-start strategies. These methods start from x(Ao = Amax) = 
and try to approximate x{Xk) for k : 0, ... by starting from an approximation of 
x(Afc_i) already obtained in the previous step instead of always restarting from 0. They 
can be used for finding an approximation of a whole range of minimizers x{\k) for a set 
of penalty parameters Amax = Aq > Ai > A2 > • • • > Aat = Agtop or, equivalently, for a set 
of £i-radii = po<Pi<P2<---<P7V = Pstop- Two examples of such methods are: 

(A) 'fixed-point continuation' method [8]: 

= Sa„^, [x(") + K^{y - Kx<^^^)] (11) 

with Ao = Amax and A„+i = aA„ and a < 1 such that Xn = Agtop (after a pre- 
determined number of steps). In other words, the threshold is decreased geo- 
metrically instead of being fixed as in the 1ST method and x^"^ is interpreted as an 
approximation of x(An). 

(B) adaptive steepest descent [5]: 

^(n+l) ^ p^^^^ [^(n) ^ pin) j^T _ ^^(n))]^ ^(0) ^ ^^2) 

with r(") = K^{y - Kx^""^), = ||r(") || VHiTrW f , and pn+i = {n + l)pstop/N. 



Here the radius pn increases arithmetically instead of being fixed as in algorithm (b) 
and x^"^ is interpreted as an approximation of x(p,„). 

Such algorithms have the advantage of providing an approximation of the Pareto curve (a 
plot of ||Erx(A) — vs. ||x(A)||i, also known as trade-off curve) as they go, instead of 
just calculating the minimizer corresponding to a fixed penalty parameter. It is useful for 
determining a suitable value of the penalty parameter A in applications. 



3 Approximation isochrones 

In this section, we discuss the problem of assessing the speed of convergence of a given 
minimization algorithm. 

The minimization problem ([T]) is often used in compressed sensing. In this context, 
an iterative minimization algorithm may be tested as follows: one chooses a (random) 
sparse input vector x™p"*, calculates the image under the linear operator and adds noise 
y = i^x^P"* -|- n. One then uses the algorithm in question to try to reconstruct the input 
vector as a minimizer of ([2j), choosing A in such a way that the resulting x^'^^ best 
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corresponds to the input vector x'^^p"*. This procedure in not useful in our case because it 
does not compare the iterates x^") (n : . . . N) with the actual minimizer x of functional 
([2]). Even though it is sparse, the input vector most likely does not satisfy equations 

of type ((H), and hence does not constitute an actual minimizer of ([1]). Such a type of 
evaluation is e.g. done in [71 section IV. A]. 

In this note we are interested in describing how well an algorithm does in finding the 
true minimizer of ([1]), not in how suitable an algorithm may be for compressed sensing 
applications. We consider sparse recovery and ^i-penalized functional minimization as two 
separate issues. Here, we want to focus on the latter. 

Another unsatisfactory method of evaluating the convergence of a minimization algo- 
rithm is to look at the behavior of the functional Fx{x^"'^) as a function of n. For small 
penalties, it is quite possible that the minimum is almost reached by a vector x^^^ that is 
still quite far from the real minimizer x. 

Suppose one has developed an iterative algorithm for the minimization of the ii- 
penalized functional ([T]), i.e. a method for the computation of x{X) in expression ([2]). As 
we are interested in evaluating an algorithm's capabilities of minimizing the functional ([T]) , 
it is reasonable that one would compare the iterates x^"^ with the exact minimizer x{\). 
I.e. evaluation of the convergence speed should look at the quantity Hx^""^ — as a function 
of time. A direct procedure exists for calculating x(A) and thus it is quite straightforward 
to make such an analysis for a whole range of values of the penalty parameter A (as long 
as the support of x(A) is not excessively large). We will see that the weaknesses of the 
iterative algorithms are already observable for penalty parameters corresponding to quite 
sparse x(A), i.e. for x that are still relatively easy to compute with the direct method. 

In doing so, one has three parameters that need to be included in a graphical repre- 
sentation: the relative reconstruction error e = Hx*-"") — x||/||x||, the penalty parameter A 
and also the time t needed to obtain the approximation x^"^. Making a 3D plot is not 
a good option because, when printed on paper or displayed on screen, it is difficult to 
accurately interpret the shape of the surface. It is therefore advantageous to try to make 
a more condensed representation of the algorithm's outcome. One particularly revealing 
possibility we suggest, is to use the A-e-plane to plot the isochrones of the algorithm. 

For a fixed amount of computer time t, these isochrones trace out the approximation 
accuracy e that can be obtained for varying values of the penalty parameter A. There 
are two distinct advantages in doing so. Firstly, it becomes immediately clear for which 
range of A the algorithm in question converges quickly and, by labeling the isochrones, in 
which time frame. Secondly, it is clear where the algorithm has trouble approaching the 
real minimizer: this is characterized by isochrones that are very close to each other and 
away from the e = line. Hence a qualitative evaluation of the convergence properties can 
be made by noticing where the isochrones are broadly and uniformly spaced (good con- 
vergence properties), and where the isochrones seem to stick together (slow convergence). 
Quantitatively, one immediately sees what the remaining relative error is. 

Another advantage of this representation is that (in some weak sense) the isochrones 
do not depend on the computer used: if one uses a faster/slower computer to make the 
calculations, the isochrones 'do not change place' in the sense that only their labels change. 
In other words, this way of representing the convergence speed of the algorithm accurately 
depicts the region where convergence is fast or slow, independently of the computer used. 

In figure m an example of such a plot is given. The operator is again K^^^ S ]^i848x8i92 
(more details are given at the start of section H]). The algorithm assessed in this plot is 
the iterative thresholding algorithm (a). Clearly one sees that the iterative thresholding 



6 



<— more sparse 20 supp a; (A) 100 less sparse 

1 \ 

(a) iterative thresholding 




Figure 2: This figure displays the approximation isochrones {t = 1,2,..., 10 minutes) 
for the iterative thresholding algorithm ^ with operator K^^\ The vertical axis is e = 
\\x^^^ — The bottom horizontal axis represents log2 Amax/A (i.e. small A to the 

right). The top horizontal axis is used to indicate the number of nonzero components in the 
corresponding minimizer (nonlinear scale). Convergence is satisfactory for A > Amax/2^ 
but very slow for A < Amax/2^ 
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does well for A > Aniax/2^, but comes into trouble for A < Amax/^*^- This means that 
the iterative thresholding algorithm ^ has trouble finding the minimizer with more than 
about 50 nonzero components (out of a possible 8192 degrees of freedom). The direct 
method is still practical in this regime as proven by figured] (left), and it was used to find 
the 'exact' x(A)'s. In fact, here the direct method is faster than the iterative methods. 

4 Comparison of minimization algorithms 

In this section we will compare the six iterative algorithms (a)-(e), and the two warm- 
start algorithms (A)-(B) mentioned in section El We will use four qualitatively different 
operators K for making this comparison. 

Firstly, we will use an ill-conditioned matrix K^^'^ stemming from a geo-science inverse 
problem. It was already used in figures [1] and [2j It contains 1848 2-D integration kernels 
discretized on a 64 x 64 grid, and expanded in a (2x redundant) wavelet frame. Hence 
this matrix has 1848 rows and 8192 (= 2 x 64^) columns. The spectrum is pictured in 
figure [3l Clearly it is severely ill-conditioned. 

The matrix K^'^^ is of the same size, but contains random numbers taken from a 
Gaussian distribution. Its spectrum is also in figure [3] and has a much better condition 
number (ratio of largest singular value to smallest nonzero singular value). This type of 
operator is often used in the evaluation of algorithms for minimization of an £i-penalized 
functional. 

As we shall see, the different minimization algorithms will not perform equally well 
for these two operators. In order to further discuss the influence of both the spectrum 
and orientation of the null space of the operator on the algorithms' behavior, we will also 
use two other operators. K^^^ will have the same well behaved spectrum as K^'^\ but 
an unfavorably oriented null space: the null space will contain many directions that are 
almost parallel to an edge or a face of the ii ball. K^^^ will have the same ill-behaved 
spectrum as K^^^ , but it will have the same singular vectors as the Gaussian matrix K^'^^ . 

More precisely, K^^^ is constructed artificially from K^'^^ by setting columns 4000 
through 8192 equal to column 4000. This creates an operator with a null space that 
contains many vectors parallel to a side or edge of the ii ball. A small perturbation is 
added in the form of another random Gaussian matrix to yield the intermediate matrix A. 
The singular value decomposition is calculated A = USV'^ {U~^ = U'^ and = V^). 
In this decomposition, the spectrum is replaced by the spectrum of K^'^\ This then forms 
the matrix K^'^\ 

K^^^ is constructed by calculating the singular value decomposition of K^'^^ = U SV''" 
and replacing the singular values in S by those of K^^\ 

In all cases, K^^^ is normalized to have its largest singular value ~ 0.999 (except when 
studying algorithm (a') where we use normalization ~ 0.999 x \/2)- 

4.1 A severely ill-conditioned operator 

In figure m we compare the six algorithms (a)-(e) mentioned in section [2] for the same ill- 
conditioned operator K^^^ G j^i848x8i92 ^ -^^ figure[2l We again choose penalty parameters 
Amax > A > Ainax/2^'^ and show the isochrones corresponding to t = 1, 2, . . . , 10 minutes. 
Panel (a) is identical to figure [2j All six algorithms do well for large penalties (left-hand 
sides of the graphs). For smaller values of A (8 < log2Ainax/A) the isochrones come 
closer together meaning that convergence progresses very slowly for algorithms (a), (a'). 
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Figure 3: Left: The singular values A„ (n : 1, . . . , 1848) of the operators K^^^ and K^'^\ 
Right: The singular values (n : 1, . . . , 1848) of the operators i^T^^^ and K^^\ The 
Gaussian random matrix K^'^^ is much better conditioned than operator K^'^\ 

(b) and (c). For algorithm (d), the isochrones are still reasonable uniformly spaced even 
for smaller values of the penalty parameter. In this case the projected algorithms do 
better than iterative thresholding, but the £i-ls algorithm (d) is to be preferred in case of 
small penalty parameters. The Fista (e) algorithm seems to perform best of all for small 
parameters A. 

Apart from the shape of the isochrone curves, it is also important to appreciate the 
top horizontal scales of these plots. The top scale indicate the size of the support of the 
corresponding minimizer x{\). We see that all algorithms have much difficulty in finding 
minimizers with more than about 100 nonzero components. In this range of the number 
of nonzero components in x(A), the direct method is faster for K^^\ 

A skeptic might argue that, in the case of figures [2] and HI the minimizer x(A) might 
not be unique for small values of A, and that this is the reason why the isochrones do not 
tend to e = after about 10 minutes (~ 4500 iterations). This is not the case. If one runs 
the iterative methods for a much longer time, one sees that the error e does go to zero. 
Such a plot is made in figure [5] for one choice of A (A ~ Ainax/2^^'^^^)- 

What we notice here is that, for iterative soft-thresholding, Hx^""''^) — x^") || = 
- SaIx^") + K'^{y - Jf2;("))||/||x(")|| is small 10"^ after 4500 iterations in this 
example). But this does not mean that the algorithm has almost converged (as might be 
suggested by figure [5]- inset); on the contrary it indicates that the algorithm is progressing 
only very slowly for this value of A! The difference should be of the 

order 10~^^, for one to be able to conclude convergence, as already announced in formula 
©• 

4.2 A Gaussian random matrix 

In this subsection we choose K = K^'^\ It is much less ill-conditioned than the matrix in 
the previous subsection. 

In figure [6] we make the same comparison of the six iterative algorithms (a)-(e) for 
the operator K^'^\ We choose penalty parameters Amax > A > Amax/2^^ and show the 
isochrones corresponding to t = 6, 12, . . . , 60 seconds. I.e. the time scale is 10 times 
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Figure 4: These pictures contain the approximation isochrones for the algorithms (a)-(e) 
mentioned in section [2] for t = 1,2... 10 minutes. The horizontal and vertical axis are 
identical to the ones used in figure [2j The operator used is K^^\ Clearly, for this example, 
methods (a), (a'), (b) and (c) have a lot of difficulty approaching the minimizer for small 
values of A (closely spaced isochrones). The £i-ls method (d) still works well, but it is 
slower for large penalties. The Fista methods (e) appears to work best for small penalty 
parameters. The GPSR method works well for relatively large values of A. See text for a 
discussion. 
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Figure 5: Main graph: Relative error — as a function of time for one particular 

value of A. The operator is K^^\ Clearly, all algorithms converge to the same limit x, but 
some algorithms are very slow. Inset: zoom-in on the first 10 minutes for the iterative 
thresholding algorithm. It appears to indicate convergence (to a different minimizer), but 
this is deceptive as the main graph shows. Labels (a)-(e) refer to the algorithms mentioned 
in section [2] and figure [H 

smaller than for the ill-conditioned matrix K^^^ in the previous section. Again all the 
algorithms do reasonably well for large penalty parameters, but performance diminishes 
for smaller values of A. 

The iterative soft-thresholding method with HX^^^H ~ \/2 in (a') does slightly better 
than iterative soft-thresholding with ~ 1 in panel (a). The GPSR method (c) 

does better than the other methods for large values of the penalty (up to |suppa;| « 500), 
but loses out for smaller values. The ^i-ls method (d) does well compared to the other 
algorithms as long as the penalty parameter is not too large. The FISTA method (e) does 
best except for the very small value of A (right hand sides of plots) where the projected 
steepest descent does better in this time scale. 

Apart from the different time scales (1 minute vs. 10 minutes) there is another, prob- 
ably even more important, difference between the behavior of the algorithms for K^^"^ and 
. In the latter case the size of the support of the minimizers that are recoverable by 
the iterative algorithms range from to about 1800 (which is about the maximum for 
1848 data). This is much more than in the case of the matrix K^^^ in figure [5] where only 
minimizers with about 120 nonzero coefficients were recoverable. 

4.3 Further examples 

Here we compare the various iterative algorithms for the operators K^^^ and K^'^\ The 
former was constructed such that many elements of its null space are almost parallel to 
the edges of the ii ball, in an effort to make the minimization ([1]) more challenging. The 
latter operator has the same singular vectors as the random Gaussian matrix K^"^^ but 
the ill-conditioned spectrum of K^^h This will then illustrate how the spectrum of K can 
influence the algorithms used for solving ([T]). 

Figure \7\ shows the isochrone lines for the various algorithms applied to the operator 
K^^^ . To make comparison with figure [6] straightforward, the total time span is again 1 
minute subdivided in 6s intervals. Convergence first progresses faster than in figure [H 
the isochrone corresponding to t = 6s lies lower than in figure [6l For A < Aniax/2^ it is 
clear that the various algorithms perform worse for the operator K^^^ than for the matrix 
K^'^\ As these two operators have identical spectra, this implies that a well-conditioned 
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(b) projected steepest descent 




Figure 6: These pictures contain the approximation isochrones for the algorithms (a)-(e) 
mentioned in section [2] for t = 6, 12, . . . , 60 seconds. The operator used in this comparison 
is the Gaussian random matrix K^'^\ This matrix is much better conditioned than the 
matrix A'^^^ which was used for figures [H The main differences are faster convergence and, 
importantly, more nonzero components are recoverable (the top scales, indicating the size 
of the support of the minimizer, are much larger than in figures U]). 
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spectrum alone is no guarantee for good performance of the algorithms. 

Figure [7] shows the isochrone plots for the operator K^^^ which has identical spectrum 
as K^^\ implying that it is very ill-conditioned. The singular vectors of K^'^^ are the 
same as for the random Gaussian matrix K^'^\ We see that the ill-conditioning of the 
spectrum influences the performance of the algorithms is a negative way, as compared to 
the Gaussian operator K^'^\ 

4.4 Warm-start strategies 

For the warm-start algorithms (A)-(B), it is not possible to draw isochrones because these 
methods are not purely iterative. They depend on a preset maximum number of iterates 
and a preset end value for the penalty parameter A or the ii norm p of x. It is possible, 
for a fixed total computation time and a fixed value of Astop or pstop, to plot ||2;(A„) — x^") || 
vs. A.„ or ||2;(y0n) — x^""^ \\ vs. /)„. This gives a condensed picture of the performance of such 
an algorithm, as it include information on the remaining error for various values of A or p. 

In figure m the warm-start methods (A) and (B) are compared in the range < < 
15 for the matrix K^^\ For each experiment, ten runs were performed, corresponding 
to total computation times of 1,2, ...,10 minutes. For each run, the parameter pmax 
was chosen to be 15 in algorithm (B). For algorithm (A), Astop was chosen such that 
||aJ(Astop)||i = 15 also. From the pictures we see that the algorithms (A)-(B) do not 
do very well for small values of the penalty parameter A. Their big advantage is clear: 
For the price of a single run, an acceptable approximation of all minimizers x(A) with 
Amax > A > Astop is obtained. The algorithm (B) does somewhat better than algorithm 
(A). 

In figure [TOl the same type of comparison is made for the matrix K^'^\ In this case, ten 
runs are performed with a total computation time per run equal to 6, 12, ... , 60 seconds 
(60s corresponds to about 460 iterative soft-thresholding steps or 300 projected steepest 
descent steps). This is ten times less than in case of the matrix K^^\ Both the 'fixed 
point continuation method' (A) and the 'projected steepest descent' (B) do acceptably 
well, even up to very small values for the penalty parameter A (large value of p). 

All the calculations in this note were done on a 2GIIz processor and 2Gb ram memory. 

5 Conclusions 

The problem of assessing the convergence properties for ii penalized least squares func- 
tional was discussed. 

We start from the rather obvious observation that convergence speed can only refer to 
the behavior of e = \\x^"'^ — as a function of time. Luckily, in the case of functional 

([T]), the exact minimizer x (up to computer round-off) can be found in a finite number 
of steps: even though the variational equations ^ are nonlinear, they can still be solved 
exactly using the L ARS/homotopy method. A direct calculation of the minimizers x(A) 
is thus possible, at least when the number of nonzero coefficients in x is not too large (e.g. 
in the numerical experiments in [7], one has 4096 degrees of freedom and only about 160 
nonzeros.). 

We provided a graph that indicates the time complexity of the exact algorithm as a 
function of s, the number of nonzero components in x(A). It showed us that computing 
time rises approximately cubically as a function of s (it is linear at first). Also we gave an 
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Figure 7: The six algorithms (a)-(e) are compared for the operator K^'^\ The time 
intervals are again 6, 12, . . . , 60s. For A < Aniax/2®, the algorithms performs worse than in 
the case of matrix K^'^^ in figure El although K^'^^ and K^^^ have the same singular values. 
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Figure 8: The six algorithms (a)-(e) are compared in case of the operator K^^\ The 
time intervals are 1,2,..., 10m as opposed to 6, . . . , 60s in figure El Clearly, replacing 
the spectrum of the random Gaussian matrix K^'^^ by the spectrum of K^^\ and leaving 
the singular vectors unaltered, changes the behavior of the minimization algorithms in a 
negative way. Convergence is much slower than in figure [6] for all algorithms. 
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Figure 9: The behavior of two warm-start strategies for the operator A'(^). The labels 
(A)~(B) refer to the list in section [2j In each picture the top line corresponds to a total 
calculation time of 1 minute. The bottom line corresponds to 10 minutes total calculation 
time (this is about 4800 iterative soft-thresholding steps or 3000 projected steepest descent 
steps). The lines in these plots are not isochrones, as these algorithms are not purely 
iterative, but depend on a preset stopping radius Pstop or stopping penalty Agtop- Pstop 
equals 15 in this case. 
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Figure 10: The behavior of two warm-start strategies for the Gaussian random matrix 
K^'^\ In each picture the top line corresponds to a total calculation time of 6 seconds. 
The bottom line to 1 minute. In this case we chose to stop at Agtop equals 
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example where the size of support of x(A) does not decrease monotonically as A decreases. 
The direct method is certainly practical for | suppa;(A)| < 10^ or so. 

It is impossible to completely characterize the performance of an iterative algorithm 
in just a single picture. A good qualitative appreciation, however, can be gained from 
looking at the approximation isochrones introduced in this note. These lines in the A — e- 
plane tell us for which value of the penalty parameter A convergence is adequately fast, 
and for which values it is inacceptably slow. We look at the region e G [0, 1] , because this 
is probably most interesting for doing real inversions. One could also use a logarithmic 
scale for e, and look at very small values of e approaching computer epsilon. But that is 
probably not of principle interest to people doing real inverse problems. The main content 
is thus in the concept of approximation isochrone and in figures EJ [H [6l [7] and [8] comparing 
six different algorithms for four operators. 

For large penalty parameters, all algorithms mentioned in this note do well for our 
particular example operators with a small preference for the GPSR method. The biggest 
difference can be found for small penalty parameters. Algorithms (a), (a'), (b) and (c) 
risk to be useless in this case. The ^i-ls (d) algorithm seems to be more robust, but it 
loses out for large penalties. The FISTA method (e) appears to work best for small values 
of the penalty parameter. 

We uncovered two aspects that may influence the convergence speed of the iterative 
algorithms. Firstly we saw that convergence is slower for ill-conditioned operators than 
for well-conditioned operators (comparison of K^'^'^ and K^^'^). Furthermore, the number 
of nonzero coefficients that a recoverable minimizer has is smaller for an ill-conditioned 
operator. In particular, the operator K^^^ that comes out of a real physical problem, 
presents a challenge for all algorithms for small values of the penalty parameter. Secondly, 
the orientation of the null-space with respect to the edges of the £i ball also influences the 
speed of convergence of the iterative algorithms, even if the operator is well-conditioned 
(comparison of K^"^^ and K^'^^). 

We also compared two warm-start strategies and showed that their main advantage is 
to yield a whole set of minimizers (for different penalty parameters) in a single run. We 
found that the adaptive-steepest descent method (B), proposed in [5] but tested here for 
the first time, does better than the fixed point continuation method (A). 
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